| notebook.community

_1. Using a little bit of algebra, prove that (4.2) is equivalent to (4.3). In other words, the logistic function representation and logit representation for the logistic regression model are equivalent._ Let us take equation (4.2):

$$p(X) = \frac{ e^{\beta_0 + \beta_1x} }{ 1 + e^{\beta_0 + \beta_1x} }$$

Now divide both sides by $1 - p(X)$: $$ \frac{p(X)}{1-p(X)} = \frac{ e^{\beta_0 + \beta_1x} }{ 1 + e^{\beta_0 + \beta_1x} } \cdot \frac{1}{1-p(X)} $$

Substitute the value of $ p(X) $: $$ \frac{p(X)}{1-p(X)} = \frac{ e^{\beta_0 + \beta_1x} }{ 1 + e^{\beta_0 + \beta_1x} } \cdot \frac{1}{1-\frac{ e^{\beta_0 + \beta_1x} }{ 1 + e^{\beta_0 + \beta_1x} }} $$ Simplifying,

$$ \frac{p(X)}{1-p(X)} = \frac{ e^{\beta_0 + \beta_1x} }{ 1 + e^{\beta_0 + \beta_1x} } \cdot \frac{ 1 + e^{\beta_0 + \beta_1x}}{1} $$

Finally, $$ \frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1x} $$

This is equation (4.3).

2. It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an observation to the class for which (4.13) is largest. Prove that this is the case. In other words, under the assumption that the observations in the kth class are drawn from a N(μk,σ2) distribution, the Bayes’ classifier assigns an observation to the class for which the discriminant function is maximized.

For one covariate and 2 classes, LDA is exactly like Bayes classification with a few restrictions:

The probability density of values of every class are assumed to be Gaussian. i.e. $f_k(x)$ is assumed to have a normal distribution for every $k$.
The variance of $x$ in each class is assumed to be the same.
The mean values of the predictors in each class are distinct.

The idea is to find the value of k that maximizes $p_k(X)$. This k will then correspond to the class we are looking for. Mathematically, We need to find:

$$\underset{k}{arg max}\hspace{2mm} p_k(x)$$

This is the value of k that maximizes $p_k(x)$. This is the same as:

$$\underset{k}{arg max}\hspace{2mm} \frac{\pi_k \cdot f_k(x)}{ \sum_{l=1}^{K}\pi_l \cdot f_l(x) } $$

But denominator $\sqrt{2 \pi \sigma^2} $ doesn’t depend on $k$ as all classes are assumed to have the same variance of predictor $x$. Discarding the denomenator, we have

$$ \underset{k}{arg max}\hspace{2mm} \pi_k \cdot e^{-\frac{(x-\mu_k)^2}{2\sigma^2}} $$

Now this value is proportional to the log value. Let us take the logarithm.

$$ \underset{k}{arg max}\hspace{2mm} log_e(\pi_k) -\frac{(x-\mu_k)^2}{2\sigma^2} $$

Expansion of the Squared term:

$$ \underset{k}{arg max}\hspace{2mm} log_e(\pi_k) -\frac{x^2- 2x\mu_k + \mu_k^2}{2\sigma^2} $$$$ \underset{k}{arg max}\hspace{2mm} log_e(\pi_k) -\frac{x^2}{2\sigma^2} + \frac{2x\mu_k}{2\sigma^2} - \frac{\mu_k^2}{2\sigma^2} $$$$ \underset{k}{arg max}\hspace{2mm} log_e(\pi_k) -\frac{x^2}{2\sigma^2} + \frac{x\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} $$

But $\frac{x^2}{2\sigma^2} $ doesn’t depend $k$. So we just need to find the following

$$ \underset{k}{arg max}\hspace{2mm} log_e(\pi_k) + \frac{x\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} $$

Which gives us

$$ \underset{k}{arg max}\hspace{2mm} \delta_k(x) $$

Where \delta_k(x) is called the linear discriminant function. $$ \delta_k(x) = log_e(\pi_k) + \frac{x\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} $$

We have thus reduced the problem from maximizing $p_k(x)$ to maximizing $\delta_k(x)$

3. This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class- specific mean vector and a class specific covariance matrix. We con- sider the simple case where p = 1; i.e. there is only one feature.

Suppose that we have K classes, and that if an observation belongs to the kth class then X comes from a one-dimensional normal dis- tribution, X ∼ N(μk,σk2). Recall that the density function for the one-dimensional normal distribution is given in (4.11). Prove that in this case, the Bayes’ classifier is not linear. Argue that it is in fact quadratic. Answer:

For p>1 covariate and multiple classes, QDA is exactly like Bayes classification with a few restrictions:

The probability density of values of every class are assumed to be Gaussian. i.e. is assumed to have a normal distribution for every k.
The covariance matrix of x in each class is assumed to be distinct.
The mean values of the predictors in each class are distinct.

We will first prove QDA is quadratic in X under these general constraints and introduce the number of predictors $p=1$ later on.

Consider the following variables:

$\mu_k$ : $p \times 1 $ Vector of means of $p$ covariates for class $k$
$\Sigma_k$ : $p \times p $ Covariance Matrix for class $k$
$\pi_k$ : $p \times 1 $ Vector of Probabilites of $p$ covariates for class $k$.

As in the derivation for the Linear Discriminant function for LDA, we have the same goal of finding the value of k that maximizes the probability $p_k(X)$. Mathematically, We need to find:

$$\underset{k}{arg max}\hspace{2mm} p_k(x)$$

This is the value of k that maximizes $p_k(x)$. This is the same as:

$$\underset{k}{arg max}\hspace{2mm} \frac{\pi_k \cdot f_k(x)}{ \sum_{l=1}^{K}\pi_l \cdot f_l(x) } $$

But the denominator doesn’t depend on the class $k$ and so the maximum value of $p_k(x)$ is proportional to the numerator. We now need to just find:

$$\underset{k}{arg max}\hspace{2mm} \pi_k \cdot f_k(x)$$

QDA assumes Gaussian distribution of the predictor x for every class.

$$ f_k(x) = \frac{1}{(2 \pi)^{p/2} \cdot |\Sigma_k|^{1/2}} \cdot e^{-\frac{(x-\mu_k)^T \cdot \Sigma_k^{-1} \cdot (x-\mu_k)}{2}} $$

So we now have to find

$$ \underset{k}{arg max}\hspace{2mm} \pi_k \cdot \frac{1}{(2 \pi)^{p/2} \cdot |\Sigma_k|^{1/2}} \cdot e^{-\frac{(x-\mu_k)^T \cdot \Sigma_k^{-1} \cdot (x-\mu_k)}{2}} $$

But $ (2 \pi)^{p/2} $ does not depend on $ k $. We need to find:

$$ \underset{k}{arg max}\hspace{2mm} \pi_k \cdot \Sigma_k^{-\frac{1}{2}} \cdot e^{-\frac{(x-\mu_k)^T \cdot \Sigma_k^{-1} \cdot (x-\mu_k)}{2}} $$

This the log of this value is proportional to the maximum:

$$ \underset{k}{arg max}\hspace{2mm} log(\pi_k) - \frac{1}{2}log(\Sigma_k) -\frac{(x-\mu_k)^T \cdot \Sigma_k^{-1} \cdot (x-\mu_k)}{2} $$

Expanding:

$$ \underset{k}{arg max}\hspace{2mm} log(\pi_k) - \frac{1}{2}log(\Sigma_k) -\frac{x^T \Sigma_k^{-1} x - \mu_k^T \Sigma_k^{-1} x - x^T \Sigma_k^{-1} \mu_k + \mu_k^T \Sigma_k^{-1} \mu_k }{2} $$

Simplifying:

$$ \underset{k}{arg max}\hspace{2mm} log(\pi_k) - \frac{1}{2}log(\Sigma_k) -\frac{x^T \Sigma_k^{-1} x}{2} - x^T \Sigma_k^{-1} \mu_k + \frac{\mu_k^T \Sigma_k^{-1} \mu_k}{2} $$

Or we say that we want to maximize the Quadratic Discriminant Function $ \delta_k(x)$:

$$ \underset{k}{arg max}\hspace{2mm} \delta_k(x) $$

Where $$ \delta_k(x) = log(\pi_k) - \frac{1}{2}log(\Sigma_k) -\frac{x^T \Sigma_k^{-1} x}{2} - x^T \Sigma_k^{-1} \mu_k + \frac{\mu_k^T \Sigma_k^{-1} \mu_k}{2} $$

The Question states that we have only 1 predictor $ p =1 $. This reduces each of $\Sigma_k$ , $\mu_k$ and $\pi_k$ to scalars. The form above then reduces to the following:

$$ \delta_k(x) = log(\pi_k) - \frac{1}{2}log(\Sigma_k) -\frac{x^2 \Sigma_k^{-1}}{2} - x \Sigma_k^{-1} \mu_k + \frac{\mu_k^2 \Sigma_k^{-1}}{2} $$

The boundary between 2 classes $ k $ and $j$ will be given by the following: $$ \delta_k(x) = \delta_j(x) $$

$$ log(\pi_k) - \frac{1}{2}log(\Sigma_k) -\frac{x^2 \Sigma_k^{-1}}{2} - x \Sigma_k^{-1} \mu_k + \frac{\mu_k^2 \Sigma_k^{-1}}{2} = log(\pi_j) - \frac{1}{2}log(\Sigma_j) -\frac{x^2 \Sigma_j^{-1}}{2} - x \Sigma_j^{-1} \mu_j + \frac{\mu_j^2 \Sigma_j^{-1}}{2} $$$$ log \Bigg(\frac{\pi_k}{\pi_j}\Bigg) - \frac{1}{2}log\Bigg(\frac{\Sigma_k}{\Sigma_j}\Bigg) - x^2\frac{\Sigma_k^{-1} - \Sigma_j^{-1}}{2} - x (\Sigma_k^{-1} \mu_k -\Sigma_j^{-1} \mu_j) + \frac{(\mu_k^2 \Sigma_k^{-1} - \mu_j^2 \Sigma_j^{-1})}{2} = 0 $$

This decision boundary is quadratic in $x$.

4. When the number of features p is large, there tends to be a deteri- oration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test ob- servation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when p is large. We will now investigate this curse.

(a) Suppose that we have a set of observations, each with measure- ments on p = 1 feature, X. We assume that X is uniformly (evenly) distributed on [0, 1]. Associated with each observation is a response value. Suppose that we wish to predict a test obser- vation’s response using only observations that are within 10 % of the range of X closest to that test observation. For instance, in order to predict the response for a test observation with X = 0.6, we will use observations in the range [0.55, 0.65]. On average, what fraction of the available observations will we use to make the prediction?

Answer: If Every number in the range $ X \epsilon [0.05, 0.95] $ will use numbers in the range $[X-0.05, X+0.05]$. For numbers outside this range, the observations also use $10%$ of the data. For $ X \epsilon [0, 0.05] $, the range of closest values is $[0, 0.1]$. In the case of $ X \epsilon [0.95, 1] $, the range of closest values is $[0.9, 1]$. On average, the fraction of observations used for prediction = 10%.

(b) Now suppose that we have a set of observations, each with measurements on p = 2 features, X1 and X2. We assume that (X1 , X2 ) are uniformly distributed on [0, 1] × [0, 1]. We wish to predict a test observation’s response using only observations that are within 10 % of the range of X1 and within 10 % of the range of X2 closest to that test observation. For instance, in order to predict the response for a test observation with X1 = 0.6 and X2 = 0.35, we will use observations in the range [0.55, 0.65] for X1 and in the range [0.3, 0.4] for X2. On average, what fraction of the available observations will we use to make the prediction?

Answer: Say we had 100 samples. For $p=1$ as the case before, all of these would be along a line of say, length 1000. Now the same 100 samples are uniformly distributed in an area of $1000 \times 1000$. This means the points are now much further away than before. Since the datapoints are uniformly distributed, the number of data points covered in a fixed region is inversely proportional to the area of this 2D plane.

$$ \frac{\textit{Number of Datapoints covered in Region}}{\textit{Number of Datapoints covered in Plane}} = \frac{\textit{Area of Plane}}{\textit{Area of Region}} $$$$ \frac{n}{100} = \frac{0.1 \times 0.1}{1 \times 1} $$$$ n = 1 $$

Had there been 100 datapoints in the plane, we would only be able to find 1 point (1% of points) on travelling 10% of the range of $X_1$ and $X_2$.

(c) Now suppose that we have a set of observations on p = 100 features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10 % of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

We will use the same relation as in (b):

$$ \frac{\textit{Number of Datapoints covered in Region}}{\textit{Number of Datapoints covered in 100D Surface}} = \frac{\textit{Volume of 100D Surface}}{\textit{Volume of Region}} $$$$ \frac{n}{100} = \frac{0.1^{100}}{1^{100}} $$$$ \frac{n}{100} = \frac{0.1^{100}}{1^{100}} $$$$ n = 10^{-98} $$

Had there been 100 datapoints in the plane, we would only be able to find $10^{-98}$ points ($10^{-98}%$ of points) on average while travelling 10% of the range of $X_1, X_2, ..., X_{100}$. So there is a good chance we may not even find a datapoint.

(d) Using your answers to parts (a)–(c), argue that a drawback of KNN when p is large is that there are very few training obser- vations “near” any given test observation.

From the answers (a) through (c), there is a pattern. The general case for number of observations in 10% range of each predictor is:

$$ n = \frac{\textit{Number of Datapoints covered in Region}}{\textit{Number of Datapoints covered in p dimensional hypercube}} \times 100 $$$$ n = 0.1^{p} \times 100 $$$$ n = 10^{-p} \times 100 $$$$ n = 10^{2-p} $$

Substituting p as 1, 2 and 100, we get cases (a), (b) and (c) respectively. As the number of predictors p increases, the number of points within the 10% range of every predictor decreases exponsentially.

(e) Now suppose that we wish to make a prediction for a test observation by creating a p-dimensional hypercube centered around the test observation that contains, on average, 10 % of the training observations. For p = 1,2, and 100, what is the length of each side of the hypercube? Comment on your answer.

We use the relation stated before. But we now consider the fact that the number of data points is directly proportional to the space volume because of uniform distribution

$$ \frac{\textit{Number of Datapoints in p dimensional hypercube}}{\textit{Number of Datapoints in Data Space}} = \frac{\textit{Volume of p dimensional hypercube}}{\textit{Volume of Data Space}} $$

The Number of Datapoints in p dimensional hypercube is 10% of the total number of datapoints. So

$$ 0.1 = \frac{\textit{Volume of p dimensional hypercube}}{\textit{Volume of Data Space}} $$$$ 0.1 = \Bigg( \frac{\textit{Side Length of p dimensional hypercube}}{\textit{Side Length of Data Space}} \Bigg)^p $$

Take power of $\frac{1}{p}$ on both sides:

$$ 0.1^\frac{1}{p} = \frac{\textit{Side Length of p dimensional hypercube}}{\textit{Side Length of Data Space}} $$

Thus, we have

$$ \textit{Side Length of p dimensional hypercube} = 0.1^\frac{1}{p} \times \textit{Side Length of Data Space} $$

Now that we have our general formula, let us conider unit length of Side Length of Data Space for unit volume.

Case 1: p = 1 : $\textit{Side Length of p dimensional hypercube} = 0.1^\frac{1}{1} = \textbf{0.1} \hspace{2mm} units $
Case 1: p = 2 : $\textit{Side Length of p dimensional hypercube} = 0.1^\frac{1}{2} = \textbf{0.316} \hspace{2mm} units $
Case 1: p = 100 : $\textit{Side Length of p dimensional hypercube} = 0.1^\frac{1}{100} = \textbf{0.977} \hspace{2mm} units $

5. We now examine the differences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

The Bayes classifier optimizes simple accuracy. If we consider higher value of simple accuracy as "performing better", then the optimal decision boundaries are linear. For test cases, LDA performs better because it creates such linear decision boundaries. In the case of Training set however, QDA creates a very flexible boundary and hence will potentially overfir the training data. Thus, QDA's training accuracy will be better than that of LDA.

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

QDA will have the better test set accuracy as LDA will underfit the data. The same is true in the case of Training set accuracy.

(c) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

As the number of sample data points increase, QDA is expected to (overall) outperform LDA. As mentioned in (a), QDA will overfit the data, leading to a high variance problem. This variance can be decreased either by increasing the number of samples or decreasing the number of predictors. We consider the former action to eliminate the high variance problem.

(d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

False. This is exactly what I mentioned in (a). The boundary created by QDA is so flexible that it overfits the data. Unless we increase the number of samples or decrease the number of predictors, LDA will outperform QDA for linear decision boundaries.

_6. Suppose we collect data for a group of students in a statistics class with variables $X_1$ = hours studied, $X_2$ = undergrad GPA, and $Y$ = receive an A. We fit a logistic regression and produce estimated coefficient,_ $\hat{\beta_0}$ =−6, $\hat{\beta_1}$ = 0.05, $\hat{\beta_2}$ = 1.

(a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.
(b) How many hours would the student in part (a) need to study to have a 50 % chance of getting an A in the class?

Answer: (a) We have the following information:

$ X_1 = 40 $
$ X_2 = 3.5 $
Goal : Find $ Pr(Y= Gets A \space | \space X) $ or $p_A(X)$

$$p_A(X) = \frac{ e^{\beta_0 + \beta_1X_1 + \beta_2X_2} }{ 1 + e^{\beta_0 + \beta_1X_1 + \beta_2X_2} }$$

Substituting the known values

$$p_A(X) = \frac{ e^{-6 + 0.05 \cdot 40 + 1 \cdot 3.5} }{ 1 + e^{-6 + 0.05 \cdot 40 + 1 \cdot 3.5} }$$$$p_A(X) = \frac{ e^{-0.5} }{ 1 + e^{-0.5} }$$$$p_A(X) = 0.3775$$

So the probability a student studies for 40 hours and has an undergrad GPA of 3.5 is 37.75%.

(b) We are given the following information:

$ Pr(Y= Gets A \space | \space X) $ or $p_A(X) = 0.5 $
$ X_2 = 3.5 $
Goal : Find $X_1$

It is better to consider the logit (log odds) form of logistic regression:

$$ log\Bigg(\frac{p_A(X)}{1-p_A(X)}\Bigg) = \beta_0 + \beta_1X_1 + \beta_2X_2 $$

Substituting values,

$$ log\Bigg(\frac{0.5}{1-0.5}\Bigg) = -6 + 0.05 \cdot X_1 + 1 \cdot 3.5 $$

$$ -6 + 0.05 \cdot X_1 + 1 \cdot 3.5 = 0$$

Solving for $X_1$

$$ X_1 = \frac{6 - 3.5}{0.05}$$$$ X_1 = 50 $$

The student needs to study for 50 hours.

7. Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on X, last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was X ̄ = 10, while the mean for those that didn’t was X ̄ = 0. In addition, the variance of X for these two sets of companies was σˆ2 = 36. Finally, 80 % of companies issued dividends. Assuming that X follows a nor- mal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was X = 4 last year.

Answer. This is a binary classification problem with 1 predictor (stock), class specific means and a common variance. We can model this problem using LDA with the following data: $$ \mu_y = 10 , \mu_n = 0 \\ \pi_y = 0.8 , \pi_n = 0.2 \\ \sigma_y^2 = \sigma_n^2 = \sigma^2 = 36 \\ Pr ( y="Yes"| X=4 ) = p_y(4) = ?\\ $$

For LDA, we have:

$$ p_y(x) = \frac{\pi_y \cdot f_y(x)}{ \sum_{l=1}^{K}\pi_l \cdot f_l(x) } $$

Where $K = 2$ is the number of classes. Substituting the values for a gaussian density function:

$$ p_y(x) = \frac{\pi_y \cdot e^{-\frac{(x-\mu_y)^2}{2\sigma^2}}}{ \pi_n \cdot e^{-\frac{(x-\mu_n)^2}{2\sigma^2}} + \pi_y \cdot e^{-\frac{(x-\mu_y)^2}{2\sigma^2}} } $$

Substituting all our given values:

$$ p_y(4) = \frac{0.8 \cdot e^{-\frac{(4-10)^2}{2 \times 36}}}{ 0.2 \cdot e^{-\frac{(4-0)^2}{2\times 36}} + 0.8 \cdot e^{-\frac{(4-10)^2}{2\times 36}} } $$$$ p_y(4) = \frac{0.8 \cdot e^{-\frac{1}{2}}}{ 0.2 \cdot e^{-\frac{2}{9}} + 0.8 \cdot e^{-\frac{1}{2}} } $$$$ p_y(4) = 0.75185 $$

There is an 75.185% chance the stock will issue a dividend this year.

8. Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20 % on the training data and 30 % on the test data. Next we use 1-nearest neigh- bors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

Answer. The goal is to compare the performance of 2 classifiers:

Logistic Regression: Train Error 20%; Test error 30%
K Nearest Neighbors with K=1 : Average Train and Test Error = 18%.

Performance is evaluated based on Test data alone. Let us compute the test error for the KNN version. We are given that the average train and test error is 18%.

$$ \frac{Train Error + Test Error}{2} = 18 $$

The Train error for K=1 is 0%. This is because every data point is closest to itself (with a distance 0). So every point in the training set will be labeled correctly. Substituting 0 in the Equation above, we end up with: $$ Test Error = 36% $$

This test error is greater than the 30% in Logistic Regression. Thus, Logistic Regression is the preferred method for the classification of new observations.

9. This problem has to do with odds. (a) On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default? (b) Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?

Answer:

Let $p(X)$ be the probability that a person will default on their credit card.

(a) Given the odds is 0.37. In other words,

$$ \frac{p(X)}{1-p(X)}= 0.37 $$

Solving for $p(X)$

$$ p(X) = 0.37 - 0.37 \cdot p(X) $$$$ p(X) = \frac{0.37}{1.37} $$$$ p(X) = 0.27 $$

So 27 of people will default on their credit card payment.

(b) Given $p(X) = 16$. We need to find the odds:

$$ odds = \frac{p(X)}{1-p(X)} $$$$ odds = \frac{0.16}{0.84} = 0.1904 $$

The odds a person will default is 0.1904